Hanoi University of Science and Technology
School of Information and Communication Technology
Master Thesis in Data Science
Unified Deep Neural Networks for Anatomical Site
Classification and Lesion Segmentation for Upper
Gastrointestinal Endoscopy
NGUYEN DUY MANH
manh.nd202657m@sis.hust.edu.vn
Supervisor: Dr. Tran Vinh Duc
Hanoi 10-2022
Author’s Declaration
I hereby declare that I am the sole author of this thesis. The results in this work
are not complete copies of any other works.
STUDENT
Nguyen Duy Manh
Contents
Contents
Abstract
List of Figures
List of Tables
List of Acronyms
1 Introduction 1
1.1 General introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Artificial Intelligence and Machine Learning 3
2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Types of learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1.1 Deep Learning and Neural Networks . . . . . . . . . 7
2.3.1.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1.3 Feed forward . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1.4 Recurrent Neural Network . . . . . . . . . . . . . . . 11
2.3.1.5 Deep Convolutional Network . . . . . . . . . . . . . 11
2.3.1.6 Training a Neural Network . . . . . . . . . . . . . . 11
2.3.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . 12
2.3.2.1 Image kernel . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2.2 The convolution operation . . . . . . . . . . . . . . . 13
2.3.2.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2.4 Activation function . . . . . . . . . . . . . . . . . . . 16
2.3.2.5 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Fully convolutional network . . . . . . . . . . . . . . . . . . . 18
2.3.4 Some common convolutional network architectures . . . . . . 20
2.3.4.1 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.4.2 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.4.3 DenseNet . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4.4 UNet . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.5 Vision Transformer . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.5.1 The Transformer . . . . . . . . . . . . . . . . . . . . 23
2.3.5.2 Transformers for Vision . . . . . . . . . . . . . . . . 24
2.3.6 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.7 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.8 Avoid overfitting . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Methodology 31
3.1 EndoUNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.3 Segmentation decoder . . . . . . . . . . . . . . . . . . . . . . 33
3.1.4 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 SFMNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Compact generalized non-local module . . . . . . . . . . . . . 37
3.2.4 Squeeze and excitation module . . . . . . . . . . . . . . . . . 37
3.2.5 Feature-aligned pyramid network . . . . . . . . . . . . . . . . 37
3.2.6 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Metrics and loss functions . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Multi-task training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Experiments 42
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Data preprocessing and data augmentation . . . . . . . . . . . . . . . 44
4.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Conclusion and future work 51
References 52
Abstract
Image Processing is a subfield of computer vision concerned with comprehending
and extracting data from digital images. There are several applications for image
processing in various fields, including face recognition, optical character recognition,
manufacturing automation inspection, medical diagnostics, and tasks connected to
autonomous vehicles, such as pedestrian detection. In recent years, the deep neural
network has become one of the most popular image processing approaches due to a
number of significant advancements.
The use of machine learning in biomedical applications can be structured into three
main orientations: (1) as a computer-aided diagnosis to help the physicians for an
efficient and early diagnosis, with a better harmonization and less contradictory
diagnosis; (2) to enhance the medical care of patients with better-personalized ther-
apies; and (3) to improve the human wellbeing, for example by analyzing the spread
of disease and social behaviors in relation to environmental factors [1]. In this work,
I propose to construct the models for the first orientation that is capable of handling
multiple simultaneous tasks pertaining to the upper gastrointestinal (GI) tract. On
a dataset of 11469 endoscopic images, the models were evaluated and produced
relatively positive results.
List of Figures
2.1 Reinforcement learning components . . . . . . . . . . . . . . . . . . . 6
2.2 Relationship between AI, ML, and DL . . . . . . . . . . . . . . . . . 7
2.3 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Illustration of a deep learning model [2] . . . . . . . . . . . . . . . . . 9
2.5 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Architecture of a CNN [3] . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Example of convolution operation [4] . . . . . . . . . . . . . . . . . . 14
2.8 Sparse connectivity, viewed from below [2] . . . . . . . . . . . . . . . 15
2.9 Sparse connectivity, viewed from above [2] . . . . . . . . . . . . . . . 15
2.10 Common activation functions [5] . . . . . . . . . . . . . . . . . . . . . 16
2.11 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.12 Average pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.13 Architecture of an FCN [6] . . . . . . . . . . . . . . . . . . . . . . . . 19
2.14 Architecture of VGG16 [7] . . . . . . . . . . . . . . . . . . . . . . . . 20
2.15 A residual block [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.16 DenseNet architecture vs ResNet architecture [9] . . . . . . . . . . . . 22
2.17 UNet architecture [10] . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.18 Attention in Neural Machine Translation . . . . . . . . . . . . . . . . 24
2.19 The Transformer - model architecture [11] . . . . . . . . . . . . . . . 25
2.20 Vision Transformer architecture [12] . . . . . . . . . . . . . . . . . . . 25
2.21 Common form of multi-task learning [2] . . . . . . . . . . . . . . . . . 26
2.22 The traditional supervised learning setup . . . . . . . . . . . . . . . . 27
2.23 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Architecture of EndoUNet . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 VGG19-based shared block . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 ResNet50-based shared block . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 DenseNet121-based shared block . . . . . . . . . . . . . . . . . . . . . 33
3.5 EndoUNet decoder configuration . . . . . . . . . . . . . . . . . . . . 34
3.6 SFMNet architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 Grouped compact generalized non-local (CGNL) module [13] . . . . . 37
3.8 A Squeeze-and-Excitation block [14] . . . . . . . . . . . . . . . . . . . 38
3.9 Overview comparison between FPN and FaPN [15] . . . . . . . . . . 38
3.10 Feature alignment module [15] . . . . . . . . . . . . . . . . . . . . . . 39
3.11 Feature selection module [15] . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Demostration of upper GI . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Some samples in anatomical dataset . . . . . . . . . . . . . . . . . . . 43
4.3 Some samples in lesion dataset . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Some samples in HP dataset . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Image augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Learning rate in training phase . . . . . . . . . . . . . . . . . . . . . 46
4.7 EndoUnet - Confusion matrix on anatomical site classification task
on a fold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.8 SFMNet - Confusion matrix on anatomical site classification task on
a fold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.9 Confusion matrices on lesion classification task on a fold. . . . . . . . 49
4.10 Some examples of the lesion segmentation task . . . . . . . . . . . . . 50
List of Tables
3.1 Detailed settings of MiT-B2 and MiT-B3 . . . . . . . . . . . . . . . . 36
4.1 Number of images in each anatomical site and lighting mode . . . . . 43
4.2 Accuracy comparison on the three classification tasks . . . . . . . . . 47
4.3 Dice Score comparison on the segmentation task . . . . . . . . . . . . 48
4.4 Number of parameters and speed of models . . . . . . . . . . . . . . . 48
List of Acronyms
GI Gastrointestinal
HP Helicobacter Pylori
AI Artificial Intelligence
ML Machine Learning
DL Deep Learning
NN Neural Network
DNN Deep Neural Network
CNN Convolutional Neural Network
RNN Recurrent Neural Network
MTL Multi-task Learning
RL Reinforcement Learning
Chapter 1
Introduction
1.1 General introduction
The upper GI tract comprises the oral cavity, the esophagus, the stomach, and
the duodenum [16]. The upper GI tract’s common diseases include inflammation
(esophagitis, gastritis), peptic ulcers, or malignancies. Upper GI tract cancers, in-
cluding esophagus and gastric cancer, are at the top of malignancies in both preva-
lences and mortalities [17]. However, the high miss rate of these lesions during
endoscopy is still a big issue in the world as well as in developing countries [18].
Therefore, improving the detection rate will increase the chances of patients receiv-
ing timely medical treatment and prolong survival time [19].
Esophagogastroduodenoscopy (EGD) is a diagnostic procedure that visualizes the
upper part of the GI tract down to the duodenum. It is an exploration method that
accurately detects lesions of the GI tract that are difficult to identify with other tools
(biomarkers or imaging). However, the lesions missing rate, defined as a negative
finding on endoscopy in patients with lesions within three years, has been reported
in the literature review. In a study published by Menon et al. [19], this rate was
11.3% and similar for both esophageal and gastric cancers. In another paper by
Shimodate et al. from a Japanese institution [20], they concluded that the miss rate
of gastric superficial neoplasms (GSN) was 75.2%. There are many reasons for this
situation, such as the heterogeneous quality of endoscopy systems, different levels
of experience in technical performance and lesions evaluation of endoscopists, and
lack of patient tolerance of the procedure. Therefore, computer-aided diagnosis is
desirable to help improve the reliability of this procedure.
Deep learning (DL) has gained remarkable success in solving various computer vi-
sion tasks. In recent years, several DL-based methods have been proposed to deal
with EGD-related tasks, such as informative frame screening, anatomical site classi-
fication, gastric lesion detection, and diagnosis. However, previous works often solve
these tasks separately. Therefore, the development of computer-aided systems ca-
1
pable of simultaneously solving all the tasks using separate task-specific DL models
would require many memory and computational resources. This makes it difficult
to deploy such systems on low-cost devices. On the other hand, collecting and an-
notating patients’ data for medical imaging analysis is challenging in practice. The
lack of data can significantly reduce the models’ performance.
In order to address these issues, this paper proposes two models to simultaneously
solve four EGD-related tasks: anatomical site classification, lesion classification,
HP classification, and lesion segmentation. The model includes a shared encoder
for learning common feature representation, followed by four output branches to
solve four tasks. As a result, models greatly benefit from multi-task training since
they can learn a powerful joint representation from an extensive dataset that com-
bines multiple data sources collected solely for each task. Experiments show that
the proposed models yield promising results in all tasks and achieve competitive
performance compared to single-task models.
1.2 Objectives
This work aims to build unified models to tackle multiple tasks relating to the
upper gastrointestinal tract. The tasks include anatomical site classification, lesion
classification, HP classification and lesion segmentation.
1.3 Main contributions
The main contributions of this study are as follows:
Introduce two unified deep learning-based models to simultaneously solve four
tasks related to the upper GI tract: a CNN-based baseline model and a
Transformer-based model.
Evaluate the proposed methods on a Vietnamese endoscopy dataset.
1.4 Outline of the thesis
The rest of this thesis is organized as follows:
Chapter 2 presents an overview of the concepts of Artificial Intelligence, Machine
Learning, Deep Learning, and related techniques.
Chapter 3 proposes a model to simultaneously solve tasks related to the upper
gastrointestinal tract.
Chapter 4 presents the content of the experiments and the results obtained.
Chapter 5 concludes the thesis.
2
Chapter 2
Artificial Intelligence and Machine
Learning
2.1 Basic concepts
Over the last few decades, several definitions of Artificial Intelligence (AI) have sur-
faced. John McCarthy [21] defined AI as follows: it is the science and technology of
creating intelligent machines, particularly intelligent computer programs. Artificial
Intellect is related to the same challenge of using computers to comprehend human
intelligence, but it is not limited to biologically observable approaches.
The ability to simulate human intelligence distinguishes AI from logic program-
ming in computer languages. In particular, AI enables computers to gain human
intelligence, such as thinking and reasoning to solve problems, communicating via
understanding language and speech, and learning and adapting.
Artificial Intelligence is, in its simplest form, a field that combines computer science
and robust datasets to enable problem-solving. In addition, it includes the subfields
of machine learning and deep learning, which are commonly associated with artificial
intelligence.
AI can be categorized in different ways. This thesis divides AI into two categories
based on its strength: weak AI and strong AI.
Weak AI, also known as Narrow AI, is a sort of AI that has been trained to
do a particular task. Weak AI imitates human perception and aids humanity by
automating time-consuming tasks and data analysis in ways that humans cannot
always perform. This sort of artificial intelligence is more accurately described as
“Narrow”, as it lacks general intelligence and instead possesses intelligence tailored
to a certain field or task. For instance, an AI that is great at navigation is typically
incapable of playing chess, and vice versa. Weak AI helps transform massive amounts
of data into useful information by identifying patterns and generating predictions.
Most of the AIs that we see today are weak AIs, with typical examples such as
3
virtual assistants (Apple’s Siri or Amazon’s Alexa), Facebook’s news feed, spam
filtering in email management applications (Gmail, Outlook), autonomous vehicles
(Tesla, VinGroup).
In addition to its strength, weak AI has the potential to wreak damage in the event
of a system failure. For instance, spam filtering systems may misidentify essential
emails and place them in the spam folder, or self-driving car systems may cause
traffic accidents owing to miscalculations, etc.
Strong AI, consisting of both Artificial General Intelligence (AGI) and Artificial
Super Intelligence (ASI). Artificial General Intelligence is a speculative kind of AI
that posits a machine has intelligence equivalent to that of a person and is capable of
self-awareness, problem-solving, learning, and future planning. Similarly, artificial
superintelligence (also known as superintelligence) is a theoretical kind of artificial
intelligence positing that a machine has superior intelligence and capacities to those
of human brains.
Strong AI is currently simply a concept with no examples in practice. Nonetheless,
academics continue to conduct research and hunt for development avenues for this
form of AI.
Machine Learning (ML) is a branch of AI and computer science that focuses on
the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy [22].
2.2 Types of learning
Given that the focus of the field of Machine Learning is “learning”, Artificial In-
telligence/Machine Learning employs three broad categories of learning to acquire
knowledge. They are Supervised Learning, Unsupervised Learning, and Reinforce-
ment Learning.
2.2.1 Supervised learning
In supervised learning, the computer is given labeled examples so that for each
input example, there is a matching output value. This strategy is intended to assist
model learning by comparing the output value created by the model with the real
output value to identify errors and then progressively modifying the model to reduce
errors. Supervised learning employs learned patterns to predict output values for
never-before-seen data (not in the training data). For classification and regression
problems, supervised learning proved itself to be accurate and fast.
Classification is the process of discovering a function that divides a dataset
into classes according to certain parameters. A computer program is trained
4
on the training dataset, and based on this training, it classifies the data into
various classes. Classification has different use cases, such as spam filtering,
customer behavior prediction, and document classification.
Regression is a method for identifying relationships between dependent and
independent variables. It aids in forecasting continuous variables, such as
Market Trends and Home Prices.
Supervised learning functions by modeling the linkages and dependencies between
the goal prediction output and the input features, such that it is feasible to predict
the output values for new data based on the associations it learned from the datasets.
2.2.2 Unsupervised learning
In contrast, the input data are not labeled in unsupervised learning. Unsupervised
learning describes a class of problems that involves using a model to describe or
extract relationships in data. Machines can learn to recognize complex processes
and patterns without human supervision. This method is especially beneficial when
specialists do not know what to search for in the data, and the data itself does not
offer targets. In actuality, the amount of unlabeled data is significantly more than
the amount of labeled data; hence, unsupervised learning algorithms play a crucial
role in machine learning.
Under many use cases of unsupervised learning, two main problems are often en-
countered: clustering which involves finding groups in data, and density estimation
which involves summarizing data distribution.
Clustering: k-mean clustering is a well-known technique of this type, where
k is the number of clusters to discover in the data. It shares the same concept
with classification. However, in this case, there are no labels provided, and
the system will understand the data itself and cluster it.
Density estimation: an example of a density estimation algorithm is Kernel
Density Estimation involves using small groups of closely related data samples
to estimate the distribution for new points in the problem space.
Due to its complexity and implementation difficulty, this sort of machine learning
is not as popular as supervised learning, although it enables the solution of issues
humans would ordinarily avoid.
2.2.3 Reinforcement learning
Reinforcement learning (RL) describes a class of problems where an agent operates
in an environment and must learn to operate using feedback. According to [23],
5
reinforcement learning is learning what to do how to map situations to actions—to
maximize a numerical reward signal. The learner is not told which actions to take
but instead must discover which actions yield the most reward by trying them.
Reinforcement learning has five essential components: the agent, environment, state,
action, and reward. The RL algorithm (called the agent) will periodically improve
by exploring the environment and going through the different possible states. To
maximize the performance, the ideal behavior will be automatically determined by
the agents. Feedback (the reward) is what allows the agent to improve its behavior.
Figure 2.1: Reinforcement learning components
The idea can be translated into the following steps of an RL agent:
1. The agent observes an input state
2. An action is determined by a decision-making function (policy)
3. The action is performed
4. The agent receives a scalar reward or reinforcement from the environment
5. Information about the reward given for that state/action pair is recorded
In RL, there are two types of tasks: episodic and continuous
Episodic task is the task that has a terminal state, this creates an episode:
a list of states, actions, rewards, and new states. Video games are a typical
example of this type of task
Continuous task: opposite to episodic task, this one has no terminal state
and will never end. In this case, the agent has to learn how to choose the
best actions and simultaneously interact with the environment. For example,
a personal assistance robot does not have a terminal state.
Two of the most used algorithms in RL are Monte Carlo and Temporal Differ-
ence (TD) Learning. The Monte Carlo method involves learning from experience.
6
It learns through sequences of states, actions, and rewards. Suppose our agent is in
state s1, takes action a1, gets a reward of r1, and is moved to state s2. This whole
sequence is an experience. TD learning is an unsupervised method for predicting the
expected value of a variable across a sequence of states. TD employs a mathemati-
cal trick to substitute complicated thinking about the future with a simple learning
procedure that yields the same outcomes. Instead of computing the whole future
reward, TD attempts to forecast the mix of immediate reward and its prediction of
future reward at the next moment.
2.3 Techniques
2.3.1 Deep Learning
Since AI has been around for a long, it has a vast array of applications and is divided
into numerous subfields. Deep Learning (DL) is a subset of ML, which is itself a
branch of AI.
The figure below is a visual representation of the relationship between AI, ML, and
DL.
Figure 2.2: Relationship between AI, ML, and DL
2.3.1.1 Deep Learning and Neural Networks
In recent years, Machine Learning has achieved considerable success in AI research,
enabling computers to outperform or come close to matching human performance
in various domains, including facial recognition, speech recognition, and language
processing.
Machine Learning is the process of teaching a computer how to accomplish a task
7
instead of programming it step-by-step. Upon completion of training, a Machine
Learning system should be able to make precise predictions when presented with
data.
Deep Learning is a subset of Machine Learning, capable of being different in some
important respects from traditional Machine Learning, allowing computers to solve
a wide range of unsolvable complex problems. As an example of a simple Machine
Learning task, we can predict how ice cream sales will change based on the outdoor
temperature. Making predictions using only a few data features in this way is
relatively simple and can be done using a Machine Learning technique called linear
regression.
However, numerous problems in the real world do not fit into such simplistic frame-
works. Recognizing handwritten numerals is an illustration of one of these difficult
real-world issues. To tackle this issue, computers must be able to handle the wide
diversity of data presentation formats. Each digit from 0 to 9 can be written in an
unlimited number of ways; the size and shape of each handwritten digit can vary
dramatically depending on the writer and the context.
Allowing the computer to learn from previous experiences and comprehend the data
by interacting with it via a system consisting of many layers of concepts is an effective
method for resolving these issues. This strategy enables computers to tackle complex
problems by constructing them from smaller ones. If this hierarchy is represented
by a graph, it will be formed by many layers and defined by deep [2]. That is the
idea behind neural networks.
A neural network is a model made up of many neurons. Each neuron is an information-
processing unit capable of receiving input, processing it, and giving appropriate
output. Figure 2.3 is the visual representation of a neural network.
Figure 2.3: Neural Network
All neural networks have an input layer, into which data is supplied before passing
8
through several layers and producing a final prediction at the output layer. In a
neural network, there are numerous hidden layers between the input and output
layers; thus, the term Deep in Deep Learning and Deep Neural Networks
refers to the vast number of hidden layers typically greater than three at the
core of these neural networks.
Neural Networks enable computers to learn multi-step programs. Wherein each
network layer is analogous to the computer’s memory after running another set of
instructions in parallel.
Figure 2.4 illustrates the process of a deep learning model recognizing an image of
a person. For a computer, an image is a set of pixels, and mapping a collection of
pixels to an object’s identity is an extremely complex process. Therefore, attempting
to learn or evaluate this mapping directly appears overwhelming. Instead, deep
learning overcomes this challenge by decomposing the intended complex mapping
into a series of layered simple mappings, each of which is defined by a distinct model
layer. Each layer of the network represents the different features from the low level
(edges, corners, contours) to the higher level features.
Figure 2.4: Illustration of a deep learning model [2]
2.3.1.2 Perceptron
In ML, the perceptron is the most commonly used term for all folks. It is a building
block of a neural network. Invented by Frank Rosenblatt in the mid of the 19
th
century, perceptron is a linear ML algorithm used for supervised learning for binary
9
classification.
The perceptron consists of three parts:
Input nodes (or one input layer): this is the fundamental component of per-
ceptron that accepts the initial data for subsequent processing. Each input
node carries a real-valued integer.
Weight and bias: weight shows the strength of the particular node, and bias
is a value to shift the activation function curve up or down
Activation function: this component is used to map the input between the
required values like (0, 1) or (-1, 1)
Figure 2.5: Perceptron
The perceptron works on these simple steps:
1. All the inputs x are multiplied with their weights w
2. Add all the multiplied values and call them weighted sum
3. Apply that weighted sum to the activation function
Perceptron is also one of the most straightforward neural network neuron represen-
tations.
2.3.1.3 Feed forward
First appeared in the 50s, the feedforward neural network was the first and simplest
type of artificial neural network. In this network, the information is processed in
only one direction - forward - from the input nodes, through hidden nodes, to the
output nodes.
10
Input layer: it contains neurons responsible for receiving input. The data is
then transmitted to the subsequent tier. The total number of neurons in the
input layer equals the number of variables in the dataset.
Hidden layer: lies between the input and output layers, this layer contains a
high number of neurons that modify the inputs, and they communicate with
the output layer.
Output layer: this is the final layer and its construction depends on the model’s
construction. In addition, the output layer is the expected characteristic since
you are aware of the desired result.
2.3.1.4 Recurrent Neural Network
Recurrent neural network (RNN) introduced another type of node called a recurrent
node. In RNN, the connection between nodes can create a cycle, so the output from
some nodes can affect subsequent input to the same nodes. RNNs are employed
when the output is influenced by the context and order of the input fed to the model.
Some typical use cases of RNNs are text autocompletion, speech recognition, and
handwriting recognition.
2.3.1.5 Deep Convolutional Network
Deep Convolutional Networks (DCNs, or Convolutional Neural Networks - CNNs)
are one of the most popular neural networks nowadays. The original concepts of
DCN appeared in studies of the visual cortex since 1980 [24]. Research into the visual
cortex shows that some neurons respond to only a small area of the image, while
others respond to larger areas of the image. These large areas are a combination of
the small areas in front. Also, some neurons respond to horizontal lines, and others
respond to vertical lines. These observations lead to the idea that the neurons in
the higher layer synthesize features from lower layers, but each neuron only looks at
part of the layer below it, not synthesizing all the information from the lower layer.
Research on the visual cortex inspired Yann Lecun in 1998 to introduce the CNN
LeNet architecture [25] with the core components of two blocks: Convolutional and
Pooling.
The architecture of CNN is described in detail in section 2.3.2.
2.3.1.6 Training a Neural Network
The process of training a neural network can be summarized in the following steps:
1. Model initialization: the beginning is the initial stage in the learning process
(the initial hypothesis). The training of neural networks can begin in any
11
location. It is, therefore, usual practice to randomize the initialization since
an iterative learning process can produce a pseudo-ideal model regardless of
the beginning point.
2. Forward propagation: after initializing the model, its performance must be
evaluated. The input will first be transmitted straight to the network layer
to calculate the model’s output. This step is known as forward propagation
because the calculation flow proceeds forward from the input to the output
via the neural network.
3. Calculate loss function: at this stage, once we have the output of the neural
network, we need to compare it with the desired output by computing the value
of the loss function. It evaluates the ability of the neural network to generate
outputs as close to the desired value as possible
4. Back-propagation: this is the essence of neural network training. Back-
propagation is the parameter tuning of the network based on the loss obtained
in the previous step. The error will be propagated from the output layer to the
layers before it. Adjusting the weights appropriately reduces the error rate,
making the model reliable.
The optimization function will help to find the weights that will hopefully
yield a smaller loss in the next iteration. Gradient descent [26] is the
technique commonly used in this step.
5. Iterate until convergence: as the weights are updated with a small delta
step at a time, the learning process will be gradual. Steps 2 to 4 will be
repeated until the stopping condition is reached. It can be the number of
repetitions of the training steps, or when the value of the loss function reaches
a threshold, etc.
2.3.2 Convolutional Neural Network
Goodfellow et al. defined Convolutional Neural Network (CNN, or ConvNet) as “a
specialized kind of neural network for processing data that has a known grid-like
topology. Examples include time-series data, which can be thought of as a 1-D grid
taking samples at regular time intervals, and image data, which can be thought of
as a 2-D grid of pixels” [2]. This model employs a mathematical operation called
convolution, thus called Convolutional Neural Network.
Why CNN over the feedforward neural network? It is due to the fact that while
a neural network can handle relatively simple data, it produces poor results with
complicated data such as images, where pixels are interdependent. Through the use
12
of filters, CNN is able to comprehend the spatial dependence between image pixels.
In addition, this architecture also provides better performance for image data by
reducing the number of related parameters and being able to reuse the weights
without losing important features of the image.
Figure 2.6 illustrates the architecture of a Convolutional Neural Network.
Figure 2.6: Architecture of a CNN [3]
2.3.2.1 Image kernel
In computer vision, image kernels are useful for image processing techniques. Dif-
ferent image effects, such as outlining, sharpening, blurring, and embossing, are
achieved by applying convolution operation between the image and different ker-
nels. In ML, they can be utilized for feature extraction, which is the process of
extracting the most important bits of input (image in this case).
In a technical sense, an image kernel is merely a matrix that provides the spatial
weight of different image pixels.
2.3.2.2 The convolution operation
Mathematically, convolution is a linear operation that results in a function by com-
puting two existing functions.
The general expression of a convolution is
g(x, y) = w f (x, y) =
m/2
X
u=m/2
n/2
X
v=n/2
w(u, v)f(x u, y v) (2.1)
Where
g(x, y) is the filtered image
f (x, y) is the original image
w is the filter
13
(m x n) is the shape of the filter
An indispensable component of convolution is the kernel matrix (filter). The anchor
point of the kernel will determine the matrix region on the image to be convolved;
typically, the anchor point is the kernel’s center. The value of each element on the
kernel is regarded as a composite factor with the value of each pixel in the region
corresponding to the kernel.
The kernel matrix is shifted through each pixel in the image, beginning in the upper
left corner and working its way to the bottom right. And position the associated
anchor point at the pixel under consideration. At each displacement, calculate the
resultant pixel using the convolution formula mentioned above.
Figure 2.7 illustrates how the filter is applied to an image. The source image is
an 8 × 8 matrix and convolution filter (kernel) is a 3 × 3 matrix. Multiply the
kernel value by the corresponding pixel values and add the product for each current
position in the resultant matrix.
Figure 2.7: Example of convolution operation [4]
2.3.2.3 Motivation
Applying convolution in DL is motivated by the fact that it can exploit three ideas
that boost NN efficiency: sparse interactions (or sparse weights), parameter
sharing, and equivariant representation. Additionally, convolution allows for
working with inputs of variable size.
In traditional neural networks, fully connected layers connect all input units to all
output units, which means that every input unit interacts with every output unit.
However, convolutional networks often have sparse interactions (also referred to
14
as sparse connectivity or sparse weights). This is achieved by decreasing the
size of the kernel relative to the input.
When processing a picture, for instance, the input may have thousands or millions of
pixels, yet we may detect small, important features such as edges with kernels that
occupy just tens or hundreds of pixels. This implies that fewer parameters must
be stored, which decreases the model’s memory requirements and computational
complexity, and increases its statistical efficiency.
Figure 2.8 and 2.9 are the demonstrations of sparse connectivity. In Figure 2.8, only
three outputs are affected by x instead of five. Similarly, in Figure 2.9, only three
inputs affect s.
Figure 2.8: Sparse connectivity, viewed from below [2]
Figure 2.9: Sparse connectivity, viewed from above [2]
Parameter sharing means using the same parameter for more than one function
in a model, unlike traditional neural networks where each component of the weight
matrix is multiplied by only one input. In CNNs, each member of the kernel is
used at every position of the input. Due to the parameter sharing employed by
the convolution operation, rather than learning a distinct set of parameters for
each location, we only learn one set. Parameter sharing helps reduce the storage
requirements of the model.
Equivariance: a function is said to be equivariant if the output changes in the same
manner as the input. Mathematically, a function f (x) is equivariant to a function g
15
if f(g(x)) = g(f(x)). In the case of convolution, let g be any function that translates
the input, then the convolution function is equivariant to g.
For instance, if we had a function g that shifts each pixel of the image I by one
pixel to the right, i.e., I
(x, y) = I(x 1, y), we can express this as follows: if we
apply transformation g to the image and then convolution, the result will be the
same as if we applied convolution to I and then translation g to the output. This
means that when processing images, if the input is moved one pixel to the right, its
representations will also shift one pixel to the right [2].
This property is a result of the particular form of parameter sharing. Since the same
weights are applied to all locations of the image, if an object appears in any of them,
it will be detected regardless of where it is in the image. This trait is highly helpful
in applications like image classification and object detection when the object may
appear more than once or be moving.
2.3.2.4 Activation function
In Neural Networks, the activation function is a node placed at the end or in the
middle. It assists in determining whether or not a neuron will fire. The activation
function represents the nonlinear modification applied to the input signal. The
altered output is then forwarded to the subsequent layer as input.
So why do we need nonlinear activation functions? They are used to prevent linear-
ity. Without activation functions, a neural network is just a linear regression model
(i.e., data would pass through layers of the model only going through linear func-
tions), and it is not enough for the more complex tasks where we need to represent
the complicated functions.
Figure 2.10 shows some common activation functions.
Figure 2.10: Common activation functions [5]
16
The formula of Sigmoid function is f(s) =
1
1+exp(s)
. If the input is big, the function
will give an output close to 1. With a small input, the function will give an output
close to 0. This function was used a lot in the past because of its very nice derivative.
In recent years, this function is rarely used since it has one basic drawback:
Sigmoid saturates and kills gradients: a noticeable disadvantage is that when
the input has a large absolute value (very negative or very positive), the gra-
dient of this function will be very close to 0. This means that the coefficients
corresponding to the unit under consideration will almost not be updated.
The ReLU function is a widely used activation function in neural networks today
because of its simplicity. The mathematical formula of ReLU is: f (x) = max(0, s).
Some of its advantages are:
The ReLU proved itself in accelerating the training of Neural Networks [27].
This acceleration is attributed because the ReLU is calculated almost instan-
taneously, and its gradient is also calculated extremely fast with a gradient of
1 if the input is greater than 0, and 0 if the input is less than 0.
Although the ReLU function has no derivative at s = 0, in practice, it is
common to define ReLU
(0) = 0 and further assert that the probability that
the input of a unit is 0 is very small.
2.3.2.5 Pooling
According to [2], a typical convolutional network layer comprises three stages. In
the first stage, it executes many parallel convolutions to generate a set of linear
activations. In the second stage, each linear activation is passed via a nonlinear
activation function, such as the Sigmoid or ReLU function. This stage is also known
as the detector stage. Finally, a pooling function is used in the third stage to
further change the layer’s output.
The first and second stages were introduced in sections 2.3.2.2 and 2.3.2.4, in this
section, we will discuss the pooling operation.
The pooling layer is accountable for lowering the spatial dimension of the convoluted
feature (the feature map). Dimensionality reduction will be used to decrease the
processing power required to process the data. In addition, it is useful for extracting
dominant characteristics that are rotationally and spatially invariant, hence sustain-
ing the model training process. The pooling layer summarizes the features present
in a region of the convolution layer-generated feature map. Therefore, subsequent
operations are done using summarized features rather than precisely positioned fea-
tures produced by the convolution layer. This makes the model more robust against
variations in the position of image features.
17
Similar to convolution, the pooling operation involves sliding a two-dimensional filter
over each channel of the feature map and summarising the features lying within the
region covered by the filter.
There are numerous sorts of pooling operations, depending on the mechanism uti-
lized. Two common pooling operations are max pooling and average pooling
Max pooling is a pooling operation that selects the maximum element from
the region of the feature map covered by the filter. Thus, the output after
the max pooling layer would be a feature map containing the most prominent
features of the previous feature map. Figure 2.11 shows how the max pooling
works.
Average pooling computes the average of the elements present in the region
of the feature map covered by the filter. Thus, while max pooling gives the
most prominent feature in a particular patch of the feature map, average
pooling gives the average of features present in a patch. The average pooling
is illustrated in Figure 2.12.
Figure 2.11: Max pooling
Figure 2.12: Average pooling
2.3.3 Fully convolutional network
Fully convolutional networks (FCNs) are neural networks that execute only convolu-
tional operations. Equivalently, an FCN is a normal CNN, where the fully connected
18
layer is substituted by another convolutional layer.
A typical CNN is not FCN since it has fully connected layers that are parameter-
dense, i.e., they have a large number of parameters. Converting a CNN to an FCN
is predicated on the fact that the fully connected layers can also be considered
convolutions that cover the entire input region [6].
Here, in a layer, each neuron only connects to a few local neurons in the previous
layers and the weight is shared between neurons. This connection structure is typi-
cally employed when the data can be understood as spatial and the features to be
retrieved are spatially local and equally likely to exist at any input place. The most
common use case for convolutional layers is image datasets.
An FCN usually consists of two parts to obtain the output:
Down-sampling path: this path is used to extract and interpret the context
semantic/contextual information. In this path, the width and height of feature
maps are gradually decreased, but their depth is increased
Up-sampling path: this path is used to recover the precise spatial information
FCNs additionally use skip connections to retrieve the fine-grained spatial informa-
tion that was lost during the downsampling process.
Figure 2.13 shows the architecture of an FCN. The model supports both feedforward
and back-propagation. The FCN uses a 1x1 convolution as the last layer to classify
pixels such that the final output will have the same shape as the input.
Figure 2.13: Architecture of an FCN [6]
19
2.3.4 Some common convolutional network architectures
2.3.4.1 VGG
VGG stands for Visual Geometry Group. Published in 2014 by Symonian et al. [28],
VGG is a standard deep convolutional neural network architecture with multiple
layers and has achieved remarkable achievements. VGG16 achieves almost 92.7%
top-5 test accuracy in ImageNet, VGG19 achieves 92.0% top-5 accuracy, 74.5% top-1
accuracy in ImageNet (16 ad 19 stand for the number of layers in the network).
The VGG architecture serves as the foundation for innovative object recognition
models. Designed as a deep neural network, the VGG outperforms benchmarks on
numerous tasks and datasets outside ImageNet. In addition, it remains one of the
most prominent image recognition architectures.
However, there are two major drawbacks to VGG:
It is slow to train.
The network architecture weights are quite large in terms of disk/bandwidth.
Figure 2.14 is the architecture of the VGG16.
Figure 2.14: Architecture of VGG16 [7]
2.3.4.2 ResNet
Developed by Microsoft in 2015, ResNet is designed to work with hundreds of layers.
There are two problems with very deep architectures:
20
Vanishing gradient: as the number of network layers increases, the value of the
product of derivatives drops until the partial derivative of the loss function ap-
proaches a value close to zero, at which point the partial derivative disappears.
This problem can be partially solved by using Batch Normalization.
Degradation: according to the observations in [8], as we increase network
depth, the accuracy gets saturated and sometimes, it even drops.
To address these problems, a simple and efficient concept called Residual Block
(Figure 2.15) has been proposed. This block ensures that the learning result of a
new layer will be at least as efficient as the result of the previous layer
Figure 2.15: A residual block [8]
With some improvements, ResNet has achieved some remarkable achievements:
93.95% top-5 accuracy, 78.25% top-1 accuracy on ImageNet.
2.3.4.3 DenseNet
DenseNet is a densely connected-convolutional network. It is pretty similar to
ResNet, although there are fundamental distinctions. DenseNet concatenates the
output of the previous layers with the output of the future layers, whereas ResNet
uses an additive approach that combines the prior layer (identity) with the future
layer. Figure 2.16 shows the difference between DenseNet and Resnet architecture.
DenseNet was designed to improve the accuracy of high-level neural networks af-
flicted by the vanishing gradient, which occurs when the distance between the input
and output layers is so great that information is lost before reaching its intended
destination.
2.3.4.4 UNet
The UNet was introduced in [10] for Bio-Medical Image Segmentation. It is a refined
design derived from a fully convolutional neural network that is utilized for fast
and accurate image segmentation. It was given this name because its construction
21
Figure 2.16: DenseNet architecture vs ResNet architecture [9]
resembles the letter U. The architecture comprises a contracting path to capture
context and a path that expands symmetrically to enable exact localization. This
structure is similar to the architecture of an encoder-decoder neural network. This
network can be trained end-to-end from very few images and outperforms the prior
best method, which was a sliding-window convolutional network.
Figure 2.17: UNet architecture [10]
In Figure 2.17, the architecture of the UNet is presented where each blue box cor-
responds to a multi-channel feature map. The number of channels is denoted on
22
top of the box. White boxes represent copied feature maps. The arrows denote the
different operations.
The architecture of UNet consists of three sections:
1. The contraction path (encoder): this section consists of multiple contraction
blocks, each block takes an input applied to two 3 × 3 (filter size) convolution
layers followed by a 2 × 2 max-pooling layer.
2. The bottleneck: this is a mediator layer between the contraction and expansion
layers. It consists of two 3×3 convolution layers followed by a 2×2 upsampling
layer.
3. The expansion path (decoder): each expansion layer consists of several ex-
pansion blocks. It also receives feature maps from the contraction path. The
input is passed into two 3×3 convolution layers followed by a 2×2 upsampling
layer.
After each encoder block, the number of feature maps doubles so that the architec-
ture can effectively learn complex structures. To ensure symmetry, the number of
feature maps utilized by the convolutional layer is halved after each decoder block.
On the expansion path, feature maps of the appropriate contraction layer are ap-
pended to the input. This action would ensure that the image is reconstructed using
the features learned during the contraction.
2.3.5 Vision Transformer
2.3.5.1 The Transformer
In neural networks, attention is a technique that simulates cognitive attention. The
impact boosts some parts of the input data while diminishing others, with the
motivation being that the network should spend greater attention on the little but
crucial bits. Context determines which parts of the material are more significant
than others.
Initially, the attention technique was used mainly in natural language processing
problems like Neural Machine Translation (NMT), Image Captioning, and Text
Summarization. Figure 2.18 is an example of attention in the problem of translating
between English and French. Brighter cells indicate that a word A from language
E1 ”attends” or is more correlated with a word B from language E2.
Before Transformers, most natural language processing tasks, especially Machine
Translation used Recurrent Neural Networks (RNNs) architecture. The weakness of
this method is that it is difficult to capture the long dependence between words in
the sentence and the training speed is slow due to sequential input processing.
23
Figure 2.18: Attention in Neural Machine Translation
Based on the attention technique, in 2017 Google published a concept called Trans-
former in the article Attention is all you need [11]. Similar to RNNs, Transformers
architecture also uses two parts: an encoder and a decoder. The difference is that
the input is pushed in simultaneously, and there will be no timestep concept in
Transformers anymore. Figure 2.19 shows the architecture of the Transformer.
2.3.5.2 Transformers for Vision
Due to their excellent effectiveness, transformers became the model of choice for
various natural language processing jobs. However, in computer vision, CNNs were
the dominant architecture.
The idea of applying transformers in computer vision achieved a breakthrough when
Dosovitskiy et al. published the Vision Transformer (ViT) in their article [12]. No-
tably, transformers scale better than CNNs: when training larger models on larger
datasets, vision transformers outperform ResNets by a wide margin. Transformers
revolutionized computer vision in a manner comparable to that of network architec-
ture design in natural language processing.
Vision Transformer is a combination of part of Transformer architecture and Mul-
tilayer Perceptron (MLP) blocks, this model aims to solve the problem of image
classification. In ViT, the input image is divided into equal-sized parts, then they
are treated as a sequence. The architecture of ViT is shown in Figure 2.20.
24
Figure 2.19: The Transformer - model architecture [11]
Figure 2.20: Vision Transformer architecture [12]
25
2.3.6 Multi-task learning
Multi-Task Learning (MTL) is a learning paradigm in machine learning that aims
to improve the model’s generalization performance by pooling examples from sev-
eral related tasks [2]. Following is an explanation of how MTL achieves this: in
supervised learning tasks, adding more training data has a beneficial effect on the
model’s parameters, hence enhancing its generalization. Likewise, when part of a
model is shared between several tasks (assuming that sharing is justified), they put
more pressure on that part’s parameters toward good values, often yielding better
generalization. In this setup, each output may be predicted by a different part of
the model, allowing the core of the model to generalize across each task for the same
inputs.
Figure 2.21 illustrates a common form of multi-task learning. This form consists of
two parts:
The generic parameters (the lower layer - h
(shared)
), which are shared between
all the tasks, benefit from pooled data.
The task-specific parameters (the upper layers), only benefit from examples of
their tasks.
Figure 2.21: Common form of multi-task learning [2]
With the increased number of examples for the shared parameters compared to the
scenario of single-task models, the model can achieve improved generalization and
generalization error bounds because of the greatly improved statistical strength.
26
This will only occur if certain assumptions regarding the statistical link between the
different tasks are true (i.e., if some of the tasks share a common characteristic).
From the perspective of deep learning, the underlying prior belief is as follows: some
of the elements that explain the variances observed in the data associated with the
various tasks are shared by two or more tasks [2].
2.3.7 Transfer learning
Transfer learning is a sort of learning in which a model is initially trained on one
task, and then a portion or the entire model is utilized as a starting point for a
related task. It is a good strategy for problems in which there is a task connected
to the primary task of interest, and the related task contains a substantial amount
of data.
With the traditional supervised learning approach, we need to provide labeled data
of a specific task and domain to train a model performing that task. Figure 2.22
represents how data should relate to the intended task. Briefly explained, a task is
a purpose our model is designed to accomplish, and a domain is the source of the
data.
Figure 2.22: The traditional supervised learning setup
When the training process on domain A is done, we expect that model A performs
well on unseen data of the same task and domain. On the other hand, a new model
B should be trained when new data on another task or domain B is given.
The traditional supervised learning paradigm fails when there are insufficient la-
beled data for the task or domain of interest to train a credible model. Transfer
learning comes into play at this point, and it enables us to handle these instances
by leveraging labeled data from a task or domain that is related. As illustrated in
27
Figure 2.23, we attempt to store the knowledge gained while solving the source task
in the source domain and apply it to the problem of interest.
Figure 2.23: Transfer learning
Transfer learning for computer vision is predicated on the idea that if a model is
trained on a sufficiently big and general dataset, it can successfully serve as a generic
model of the visual world. We may then utilize the acquired feature maps without
having to start from scratch.
Transfer learning differs from multi-task learning in that tasks are learned sequen-
tially, whereas multi-task learning seeks good performance on all relevant tasks
simultaneously and in parallel using a single model.
A predictive model, such as an artificial neural network, can be trained on a huge
stock of car images, and then the model’s weights can be utilized as a starting point
when training on a smaller, more specialized dataset, such as photographs of trucks.
The features that the model has already learned, such as edges and patterns, will
be beneficial for the new related task.
As previously mentioned, transfer learning is especially effective with incrementally
learned models, and an existing model can be utilized as a starting point for con-
tinuing training, such as with deep learning networks.
28
2.3.8 Avoid overfitting
In ML and DL, overfitting is an important aspect that needs to be considered. It
happens when the model tries too hard to fit the training data (including normal
samples and noise, by noise, we refer to the data points that incorrectly represent
the attributes of the data) to the extent that it negatively impacts the performance
of the model on new data (i.e., poor performance on test data).
There are some different methods to avoid this exposure:
Validation: when building the model, we must not use test data, so we will
find a way to know the quality of the model with unseen data. The simple
method is to extract a small subset from the training data set and perform the
model evaluation on this small subset. The small subset extracted from this
training set is called the validation set. At this point, the training set is the
remainder of the original training set. Train error is calculated on this new
training set. Besides, there is one more concept that is defined similarly called
validation error, which is the error calculated on the validation set.
Using this new concept, we identify a model in which both the training error
and the validation error are small, so we suppose that the test error will likewise
be small. Using a range of models is the most typical practice. The best model
is the one with the minimum validation error.
Cross-validation: in many cases, we have a small amount of data to construct
a model. If too much training set data is removed as validation data, the
remaining training set data is insufficient to create the model. At this point,
the validation set must be quite tiny for the training set to remain sufficiently
large. However, another difficulty emerged. When the validation set is too
small, the remaining training set may be overfitting.
Cross-validation is an enhancement of validation in which a limited amount of
data is included in the validation set, but the quality of the model is evaluated
on many validation sets. The training set is typically divided into k subsets
with no shared items and roughly equal size. At every run, one of the k subsets
is selected as the validate set. The model will be constructed by combining
the remaining k 1 subsets. The ultimate model is determined by averaging
the train and validation errors. This method is often referred to as k f old
cross-validation.
Dropout: dropout is a mechanism for randomly turning on/off units in the
network. Turning off means the unit’s value is set to zero, and regular feedfor-
ward and backpropagation are calculated during training. In this manner, the
29
neural network is compelled to learn the new data representations and path-
ways through which the data must flow. This not only decreases computation
but also helps prevent overfitting.
Batch normalization: is an algorithmic method that makes the training
of Deep Neural Networks (DNNs) faster and more stable. It rescales and
decorrelates vectors from a hidden layer using the mean and variance of the
current batch. This normalization step is applied right before (or after) the
nonlinear function.
Regularization: is a method to change the model slightly to avoid overfitting
while keeping the generalizability of a model (generalizability is the descrip-
tiveness of a lot of data in both training and test sets). Specifically, we will
attempt to relocate the loss function optimization issue solution to a position
close to it. Even though the value of the loss function is somewhat raised, the
direction of movement will be the one that makes the model simpler. Some
common techniques are early stopping, adding a term that evaluates model
complexity to the loss function (then it is called regularized loss function),
and l2 regularization.
Data augmentation: is one of the best techniques for reducing overfitting.
When the size of the training data set is small, the model does not get a
complete view of the data, and it becomes too powerful for the amount of
data fed into it. So we need to increase the number of images in the training
data set to avoid this problem. Data augmentations techniques generate new
similar training data based on the original ones. Using these techniques, we
diversify the dataset, and the network is trained on several similar samples
but from different perspectives.
30
Chapter 3
Methodology
3.1 EndoUNet
3.1.1 Overall architecture
The proposed EndoUNet takes input from multiple sources. Our network follows the
encoder-decoder architecture, where the encoder can be any arbitrary CNN-based
backbone. The encoder learns shared feature representation to serve multiple tasks.
There are four output branches after the encoder. The first branch is the segmen-
tation decoder with many upsampling blocks to output the prediction mask. The
remaining branches consist of three fully-connected neural networks, each of which
serves as a classification task. The overall architecture of EndoUNet is depicted in
Fig. 3.1.
Figure 3.1: Architecture of EndoUNet
3.1.2 Encoder
The encoder learns a joint representation from many data sources for multiple tasks:
anatomical site classification, lesion classification, HP classification, and segmenta-
31
tion. The output of the encoder is a high-level coarse feature map suitable for
classification tasks. Moreover, this feature map also can be used as the input for
the decoder to predict lesion segmentation masks.
In this work, we design the encoder based on CNN architectures. Particularly, we
evaluate three popular CNN backbones including VGG-19 [28], ResNet-50 [8], and
DenseNet-121 [29]. All these CNN-based backbones are pre-trained on the ImageNet
dataset.
Figure 3.2 is the shared block based on VGG-19 architecture. Outputs of the high-
lighted layers will be passed to the expansion path, and the last layer is the bottle-
neck.
Figure 3.2: VGG19-based shared block
Figure 3.3 shows the CNN shared block based on ResNet-50 architecture. The
outputs of layers conv1, conv2 x, conv3 x, conv4 x are passed to the decoder
through shortcut connections, and the layer conv5 x is the bottleneck of the net-
work.
Figure 3.4 is the DenseNet121-based shared block. The outputs of layers Covolu-
tion and Dense Block 1, 2, 3 are passed to the decoder, and the combination of
the last two layers is the bottleneck of the network.
32
Figure 3.3: ResNet50-based shared block
Figure 3.4: DenseNet121-based shared block
3.1.3 Segmentation decoder
In this branch, we use multiple blocks to upsample the intermediate results and
output the final segmentation mask at the end. Each decoder block consists of a
transposed convolution that halves the number of feature channels, followed by a
batch normalization (BN) layer and a ReLU layer. The resulting feature map is
33
then concatenated with the corresponding feature map from the encoder, followed
by a 3x3 convolution and a second BN layer. Here BN layers are applied to reduce
the internal covariant shift and the dependence of gradients on the parameter scale.
Figure 3.5 shows the configuration of the decoder.
Figure 3.5: EndoUNet decoder configuration
3.1.4 Classifiers
After passing through the CNN-based shared encoder, we obtain a high-level feature
representation of the input data. This high-level feature map is first passed to a
global average pooling layer, then a shared fully-connected layer, followed by three
separate output branches corresponding to the three classification tasks.
Each output branch consists of a fully-connected layer with 512 neurons, followed
by an output layer. The output layer of the first branch is the ten-neurons layer
representing ten anatomical sites of the upper GI. The second output layer has
six neurons, representing six classes corresponding to five types of lesions and non-
lesion. Finally, the last branch’s output layer contains only one neuron, which
decides whether Helicobacter pylori (HP) exists or not. All of the outputs are
activated by the softmax function.
3.2 SFMNet
3.2.1 Overall architecture
SFMNet is the SegFormer-based model [30]. Its overall architecture is shown in
Figure 3.6. This network uses the encoder-decoder design, where the encoder can
be any Mix Transformer (MiT) configuration [30]. The encoder learns the shared
feature representation to serve multiple tasks and generates four intermediate feature
maps, of which the final three are delivered to the CGNL and SE blocks to aggregate
context information before being sent to the module FaPN to generate the prediction
34
mask. Simultaneously, the output of the last SE block serves as the input to the
classifiers, which are the MLPs.
Figure 3.6: SFMNet architecture
The detail of modules CGNL, SE, and FaPN is discussed in sections 3.2.3, 3.2.4,
and 3.2.5.
3.2.2 Encoder
The encoder of our network is based on the MiT encoder, which is designed in
SegFormer [30]. MiT is basically inspired by ViT [12], but it has some improvements.
Hierarchical feature representation: while ViT only generates one feature map,
MiT is designed to create multi-level features consisting of high-resolution coarse and
low-resolution fine-grained features. The height and width of the feature maps are
halved after each layer.
Overlapped patch merging: in ViT, the patch merging process is designed to
combine non-overlapping image patches. Consequently, it fails to maintain the local
continuity surrounding such patches. Instead, MiT uses an overlapping patch merg-
ing process. The parameters K, S and P are defined where K is the patch size, S
is the stride between two adjacent patches, and P is the padding size.
Efficient self-attention: self-attention layer is the primary compute bottleneck of
the encoder. The computational complexity of this process is O(N
2
) where N is the
sequence length. By using a reduction ratio R to reduce the length of the sequence,
the complexity of the self-attention mechanism is reduced from O(N
2
) to O(
N
2
R
).
Mix-FFN: ViT introduces the location information via positional encoding (PE).
However, PE has a fixed resolution. Therefore, when the test resolution differs from
the training resolution, the positional code must be interpolated, which frequently
results in decreased accuracy. To alleviate this problem, MiT introduces the Mix-
FFN, which takes into account the effect of zero padding on location information
leakage by using a 3 × 3 conv in the feed-forward network (FFN). Mix-FFN is
formulated as:
35
x
out
= MLP (GELU (Conv
3×3
(MLP (x
i
n))) + x
in
(3.1)
In this work, we evaluate two configurations of MiT, which are MiT-B2 and MiT-B3.
Their detail is described in Table 3.1.
Table 3.1: Detailed settings of MiT-B2 and MiT-B3
Output Size Layer Name MiT-B2 MiT-B3
Stage 1
H
4
×
W
4
Overlapping Patch Embedding
K
1
= 7; S
1
= 4; P
1
= 3
C
1
= 64
Transformer Encoder
R
1
= 8 R
1
= 8
N
1
= 1 N
1
= 1
E
1
= 8 E
1
= 8
L
1
= 3 N
1
= 3
Stage 2
H
8
×
W
8
Overlapping Patch Embedding
K
2
= 3; S
2
= 2; P
2
= 1
C
2
= 128
Transformer Encoder
R
2
= 4 R
2
= 3
N
2
= 2 N
2
= 2
E
2
= 8 E
2
= 8
L
2
= 3 N
2
= 3
Stage 3
H
16
×
W
16
Overlapping Patch Embedding
K
3
= 3; S
3
= 2; P
3
= 1
C
3
= 320
Transformer Encoder
R
3
= 2 R
3
= 2
N
3
= 5 N
3
= 5
E
3
= 4 E
3
= 4
L
3
= 6 N
3
= 18
Stage 4
H
32
×
W
32
Overlapping Patch Embedding
K
4
= 3; S
4
= 2; P
4
= 1
C
4
= 512
Transformer Encoder
R
4
= 1 R
4
= 1
N
4
= 8 N
4
= 8
E
4
= 4 E
4
= 4
L
4
= 3 N
4
= 3
The meanings of hyper-parameters are:
K
i
: the patch size of the overlapping patch embedding
S
i
: the stride of the overlapping patch embedding
P
i
: the padding size of the overlapping patch embedding
C
i
: the channel number of the output
L
i
: the number of encoder layers
R
i
: the reduction ratio of the Efficient Self-Attention
36
N
i
: the head number of the Efficient Self-Attention
E
i
: the expansion ratio of the feed-forward layer
3.2.3 Compact generalized non-local module
In computer vision, one of the new ways to extend global information without fo-
cusing on improving convolutions is to use scalar operations to directly compute the
correlation on each region of the feature. The non-local module [31] is designed to
calculate the dependency between any two positions of the image. However, this
module lacks the mechanism to model the interaction between positions across chan-
nels. The compact generalized non-local network (CGNL) addresses this problem
simply by flattening the output of the features after going through the linear trans-
form layers, thereby adding separate weights to the channels. However, this increases
the amount of computation. Therefore, the authors have used techniques to divide
the channels into groups and use Taylor expansion to minimize the computational
weight.
Figure 3.7 illustrates the data flow in CGNL module.
Figure 3.7: Grouped compact generalized non-local (CGNL) module [13]
3.2.4 Squeeze and excitation module
Squeeze-and-Excitation Networks (SENets) [14] introduce a building block for CNNs
that improves channel interdependencies at almost no computational cost. The main
idea of SE block is to add parameters to each channel of a convolutional block so
that the network can adaptively adjust the weighting of each feature map instead
of keeping the weight of its channels equally.
Due to its simplicity and efficiency, the SE block can be added to any model.
3.2.5 Feature-aligned pyramid network
In the encoder-decoder architecture, there’s a problem which is the loss of object
information when the image is reconstructed from the output of the encoder, es-
pecially for small objects. To solve this problem, we will create skip connections
37
Figure 3.8: A Squeeze-and-Excitation block [14]
(similar to ResNet) between the encoder and decoder to help the detector to pre-
dict object locations better. The information between the encoder and decoder is
combined simply by addition, as illustrated in Figure 3.9 (on the left). It works,
but this is not an efficient way to combine information. To address this, Huang
et al. [15] proposed the so-called Feature-aligned Pyramid Network (FaPN), which
contains two new modules: one for feature alignment - Feature Alignment Module
(FAM), and another for feature selection - Feature Selection Module (FSM), which
emphasizes lower-level features with rich spatial information.
Figure 3.9: Overview comparison between FPN and FaPN [15]
Feature Alignment Module: due to the recursive use of downsampling opera-
tions, there is foreseeable spatial misalignment between the decoder feature maps
and the corresponding encoder feature maps. Thus, the feature fusion by either
element-wise addition or channel-wise concatenation would harm the prediction
around object boundaries. By using deformable convolution and learnable offset
fields, FAM can adjust its convolutional sample locations. Figure 3.10 is the illus-
tration of how FAM works.
Feature Selection Module: instead of simply using a 1 × 1 convolution to reduce
the number of channels for detailed features, FSM assigns weights for feature maps.
38
Figure 3.10: Feature alignment module [15]
The data flow of FSM is described in Figure 3.11.
Figure 3.11: Feature selection module [15]
The design of FSM is similar to the SE block discussed in section 3.2.4, but FSM has
the skip connection between the input and scaled feature maps. Another difference
between them is SE is usually used in the backbone for enhancing feature extraction,
while FSM is used in the decoder for enhancing multi-scale feature aggregation.
Additionally, the output of FSM is one of the inputs of module FAM.
3.2.6 Classifiers
The classifiers of this network are designed similarly to the classifiers of the En-
doUNet introduced in section 3.1.4. The last output of the encoder is processed by
a global average pooling layer and flattened before being sent to the MLPs.
3.3 Metrics and loss functions
Loss functions play an essential role in designing and training complicated deep
learning models because they drive the algorithm’s learning process. Along with
39
that, metrics are equally important components as they help us to know how good
the model is.
In this work, two metrics are used to evaluate the models: accuracy and Dice coef-
ficient.
Accuracy: is the fraction of true predictions over the total number of predic-
tions.
Dice coefficient: is a statistic used to calculate the similarity of two samples.
The loss functions used to train models are Cross Entropy loss and Dice loss.
Cross Entropy loss: Cross-Entropy [32] is defined as a measure of the dif-
ference between two probability distributions for a given random variable or
set of events. It is widely used for classification, but it also works well for
segmentation because segmentation can be treated as pixel-level classification
Dice loss: Dice Loss [33] is the loss calculated based on the Dice coefficient.
The formula for Dice loss is as follows:
DICE LOSS =
2 ×
P
p
true
× p
pred
P
p
2
true
+
P
p
2
pred
+ ϵ
(3.2)
3.4 Multi-task training
The data used to train our models come from many sources, and we merge them
to make an extensive training dataset. In this new dataset, each sample is only
related to some of the tasks. We denote µ
t
i
{0, 1} as the sample type indicator,
where t {pos, le, hp, seg} corresponds to anatomical site classification, lesion type
classification, HP classification, and lesion segmentation tasks, respectively. Here
µ
t
i
= 1 if the i-th sample is associated with the t-th task, and µ
t
i
= 0, otherwise.
Suppose that y
t
i
is the one-hot encoding of the correct class label of the i-th sample
in the t-th task. Let
ˆ
y
t
i
be the probabilistic output of the i-th sample in the t-th
task. If t {pos, le, hp} then y
t
i
and
ˆ
y
t
i
are vectors whose length equals the number
of classes, which is either C
pos
= 10, C
le
= 6, or C
hp
= 1 where C
pos
is the number
of anatomical sites, C
le
is the number of lesion classes including five lesion types
and the negative one, and C
hp
to decide whether HP exists or not. If t = seg, then
y
t
i
and
ˆ
y
t
i
are 2D matrices having the same size as the input images do. Note that
in the segmentation task, all lesions of different types are merged into a common
lesion class. Therefore, the segmentation task is binary, distinguishing lesion pixels
from the background ones.
40
The loss L
pos
for the anatomical site classification task is a multi-class cross-entropy
loss defined as follows:
L
pos
=
N
X
i=1
µ
pos
i
C
pos
X
j=1
y
pos
i
(j) log ˆy
pos
i
(j)
(3.3)
where N is the number of training samples.
The loss L
le
for lesion type classification task is another multi-class cross-entropy
loss defined as follows:
L
le
=
N
X
i=1
µ
le
i
C
le
X
j=1
y
le
i
(j) log ˆy
le
i
(j)
(3.4)
The loss L
hp
for HP classification is the binary cross-entropy loss defined as follows:
L
hp
=
N
X
i=1
µ
hp
i
y
hp
i
log ˆy
hp
i
+ (1 y
hp
i
) log (1 ˆy
hp
i
)
(3.5)
The loss L
seg
for the lesion segmentation task is the primary loss to drive the learning
process of the lesion segmentation task. It is defined as the sum of binary cross-
entropy loss and dice loss as follows:
L
seg
=
N
X
i=1
µ
seg
i
BCE(y
seg
i
,
ˆ
y
seg
i
) + DICE(y
seg
i
,
ˆ
y
seg
i
)

(3.6)
The total loss function for training is a weighted sum of component loss functions,
as shown below:
L
total
= λ
1
L
pos
+ λ
2
L
le
+ λ
3
L
hp
+ λ
4
L
seg
(3.7)
where λ
t
indicates the importance level of the t-th task. In this work, we set λ
1
=
λ
2
= λ
3
= λ
4
= 1.
41
Chapter 4
Experiments
4.1 Datasets
The training data used in this work is actual data collected from endoscopy findings
of patients at the Institute of Gastroenterology and Hepatology and Hanoi Medical
University Hospital. There are three sub-datasets: one for the anatomical site clas-
sification, another for the lesion segmentation and classification, and the last for the
HP classification. They are combined into a huge data set for training and testing.
Figure 4.1: Demostration of upper GI
Anatomical site dataset
This dataset includes 5546 images of 10 anatomical sites, all of which are captured
directly from the endoscopic machine, including four lighting modes: WLI (White
Light Imaging), FICE (Flexible spectral Imaging Color Enhancement), BLI (Blue
42
Light Imaging), and LCI (Linked Color Imaging). The images in this dataset do
not contain any lesions and have labels specifying the anatomical site. Table 4.1
describes the details of this dataset.
Table 4.1: Number of images in each anatomical site and lighting mode
Anatomical site WLI FICE BLI LCI TOTAL
Pharynx 177 134 120 119 550
Esophagus 169 141 116 127 553
Cardia 163 120 132 140 555
Gastric body 174 135 124 120 553
Gastric fundus 170 130 126 128 554
Gastric antrum 155 143 131 125 554
Greater curvature 171 131 126 125 553
Lesser curvature 155 140 134 126 555
Duodenum bulb 156 141 135 128 560
Duodenum 163 138 127 131 559
Figure 4.2: Some samples in anatomical dataset
Lesion dataset
In this dataset, we have 4104 images of 5 types of lesions: reflux esophagitis,
esophageal cancer, gastritis, stomach cancer, and duodenal ulcer. The images in
this dataset have the annotations for both the classification and segmentation tasks.
The numbers of images for reflux esophagitis, esophageal cancer, stomach cancer,
and duodenal ulcer classes are 1335, 538, 1443, 538, and 250, respectively.
Figure 4.3 shows some samples in the lesion dataset.
HP dataset
We have 1819 images in this dataset, including HP-positive and HP-negative images.
Figure 4.4 are some samples in the HP dataset.
43
Figure 4.3: Some samples in lesion dataset
Figure 4.4: Some samples in HP dataset
4.2 Data preprocessing and data augmentation
Given that the images come from multiple sources and have different sizes, they are
first resized to 480x480 before being fed into the model for training.
Deep learning models usually require many data to reach better performance. Hence,
we can use data augmentation techniques to generate more data for the training
phase. The training data is augmented on the fly with a probability of 0.5 (i.e., each
image has a 50% chance of being augmented every time it is selected for training).
The following techniques are used:
Horizontal flip
Vertical flip
Rotate
Shift
Zoom in/out
Motion blur
Hue saturation
44
Figure 4.5 illustrates how an image transforms after applying augmentation tech-
niques.
Figure 4.5: Image augmentation
4.3 Implementation details
Experimental environment: the experiments were conducted in the environment
with the following specifications:
OS: Ubuntu 20.04.3 LTS 64 bit
RAM: 128GB
CPU: AMD Ryzen 3970X 3.7GHz 32 cores / 64 threads
GPU: NVIDIA RTX 3090 24GB
Framework: the models are implemented using Pytorch, a Python framework that
allows tracking all calculations performed on learnable weights and provides many
common modules in neural networks, making it easier to install and monitor the
model.
45
Dataset preparing: the models are evaluated using the 5-fold cross-validation
schema. We split each dataset into five subfolds, and each fold of the big dataset
is then created by merging subfolds from datasets. In detail, in the anatomical site
dataset, each subfold contains the same number of images in each anatomical site
and lighting mode. In the lesion dataset, each subfold contains the same number of
images of each lesion type. In the HP data set, each fold contains the same number
of samples of HP positive and HP negative. Finally, we create a marker vector µ to
indicate the sample type.
Dynamic learning rate: in the training phase, the linear warmup and cosine
annealing are used to update the learning rate. The minimum learning rate is 10
6
,
and the maximum is 10
3
. The learning rate will increase rapidly in the first two
epochs (warmup) and gradually decrease in the subsequent epochs based on the
cosine function. Figure 4.6 describes the change in learning rate during training.
Figure 4.6: Learning rate in training phase
4.4 Experimental results
We perform the following experiments to validate our model:
An ablation study to evaluate the impact of three popular CNN backbones on
EndoUnet, including VGG19, ResNet-50, and DenseNet-121.
An ablation study to evaluate the impact of two MiT configurations on SFM-
based, including MiT-B2 and MiT-B3.
In the classification tasks, we train the single-tasking instances of models,
including VG19, Resnet50, DenseNet121, and MiT-B3, each of them trained
on separate data. We compare the performance of multi-tasking models and
the single-tasking models.
In the lesion segmentation task, we train five single-tasking instances of UNet
and five single-tasking instances of SFMNet, each of which trained on separate
46
lesion data. Then, we compare the performance of multi-tasking models versus
the single-tasking instances.
To demonstrate the efficacy of transfer learning, two multi-tasking instances
are trained without pre-trained parameters for comparison with models that
employed them.
We will discuss models using pre-trained parameters first. Models that do not use
pre-trained parameters will be discussed later.
The effectiveness of different backbones on the accuracy of models is analyzed in
Table 4.2. There is typically no significant variation in the performance of models
on datasets. On two datasets anatomical site and lesion the models perform
rather well. The lowest accuracies for the anatomical site classification, lesion clas-
sification, and HP classification tasks are 97.07%, 98.51%, and 91.21%, respectively,
for the single-tasking VGG19. The findings of multi-tasking models are generally
superior to those of single-tasking models. SFMNet with MiT-B3 as the backbone
and EndoUNet with Resnet50 as the backbone are the models with the best perfor-
mance on three tasks: anatomical site classification at 98.46%, lesion classification
at 99.63%, and HP classification at 93.46%.
Table 4.2: Accuracy comparison on the three classification tasks
Method Backbone Anatomical site classification Lesion classification HP classification
VGG19 (cls only) 97.07 ± 0.29% 98.51 ± 0.69% 91.21 ± 1, 27%
Resnet50 (cls only) 97.53 ± 0.29% 98.79 ± 1.09% 91.87 ± 1.08%
DenseNet121 (cls only) 97.65 ± 0.29% 99.16 ± 0.76% 91.81 ± 1.23%
MiT-B3 (cls only) 97.58 ± 0.86% 99.45 ± 0.36% 91.43 ± 1.15%
EndoUNet
VGG19 98.09 ± 0.30% 99.58 ± 0.44% 93.13 ± 1.02%
ResNet50 98.00 ± 0.49% 99.63 ± 0.26% 93.46 ± 0.83%
ResNet50 (no pretrained) 95.86 ± 0.35% 99.16 ± 0.41% 93.12 ± 0.71%
DenseNet121 98.28 ± 0.50% 99.44 ± 0.75% 93.19 ± 1.14%
SFM-based
MiT-B2 98.30 ± 0.31% 99.11 ± 0.76% 93.35 ± 0.79%
MiT-B3 98.46 ± 0.41% 99.54 ± 0.65% 93.29 ± 0.82%
MiT-B3 (no pretrained) 91.26 ± 0.41% 99.00 ± 0.26% 93.21 ± 0.83%
The results of the segmentation task are shown in Table 4.3. In the majority of
tests, the multi-tasking model outperforms its single-tasking counterpart, except for
the SFMNet model in the duodenal ulcer and stomach cancer dataset. The fact that
the combination of MiT- B3, CGNL, SE, and FaPN achieves the greatest dice scores
across all datasets demonstrates its effectiveness in the segmentation task. Figure
4.10 demonstrates several examples of segmentation results.
Table 4.4 measures the number of parameters and speed of models. Despite having
a greater number of parameters and workloads, multi-tasking models attain compa-
rable speeds to single-tasking ones. SFMNet with MiT-B3 backbone has the slowest
performance of all the models, hitting 21 FPS, indicating that the models can be
47
Table 4.3: Dice Score comparison on the segmentation task
Method Backbone
Reflux
esophagitis
Esophageal
cancer
Duodenal
ulcer
Gastritis
Stomach
cancer
UNet (seg only) ResNet50 0.457 ± 0.011 0.807 ± 0.005 0.709 ± 0.021 0.444 ± 0.057 0.854 ± 0.021
SFM-based (seg only) MiT-B3 0.515 ± 0.002 0.839 ± 0.012 0.737 ± 0.008 0.477 ± 0.051 0.896 ± 0.009
EndoUNet
VGG19 0.462 ± 0.014 0.807 ± 0.006 0.648 ± 0.024 0.419 ± 0.048 0.851 ± 0.009
ResNet50 0.464 ± 0.006 0.819 ± 0.009 0.676 ± 0.024 0.443 ± 0.065 0.860 ± 0.009
ResNet50 (no pretrained) 0.318 ± 0.006 0.727 ± 0.009 0.488 ± 0.024 0.243 ± 0.065 0.798 ± 0.009
DenseNet121 0.474 ± 0.008 0.824 ± 0.007 0.670 ± 0.014 0.457 ± 0.066 0.866 ± 0.014
SFM-based
MiT-B2 0.493 ± 0.021 0.837 ± 0.012 0.704 ± 0.025 0.476 ± 0.074 0.885 ± 0.007
MiT-B3 0.517 ± 0.007 0.847 ± 0.012 0.723 ± 0.009 0.502 ± 0.072 0.892 ± 0.008
MiT-B3 (no pretrained) 0.312 ± 0.033 0.695 ± 0.024 0.338 ± 0.019 0.193 ± 0.053 0.758 ± 0.009
enhanced to serve real-time applications.
Table 4.4: Number of parameters and speed of models
Method Backbone
Number of parameters
(million)
Speed (FPS)
VGG19 (cls only) 20.3 36
Resnet50 (cls only) 26.6 33
DenseNet121 (cls only) 7.5 26
MiT-B3 (cls only) 44.8 24
UNet (seg only) ResNet50 38.5 32
SFM based (seg only) MiT-B3 45.6 21
EndoUNet
VGG19 26.2 34
ResNet50 41.2 31
DenseNet121 17.1 24
SFM-based
MiT-B2 26.4 24
MiT-B3 46.3 21
The Figures 4.7 and 4.8 represent the confusion matrices of EndoUnet and SFMNet
on a fold on anatomical site classification task. The similarity between the two
models is that the accuracy of the gastric body class is the lowest, at 88.07%, and
it is easily confused with other similar stomach classes, including gastric fundus,
gastric antrum, greater curvature, and lesser curvature.
Figure 4.9 depicts the confusion matrices for the lesion classification task. On this
dataset, the models perform quite well.
For the models that do not use pretrained parameters, in the classification tasks,
as we can see in Table 4.2, they reach relatively good results on the lesion classifi-
cation task and HP classification task. But their accuracy for the anatomical site
classification is quite low compared to other models. In the segmentation tasks,
the performance drops significantly without using pretrained parameters. Thus, we
48
Figure 4.7: EndoUnet - Confusion matrix on anatomical site classification task on
a fold.
Figure 4.8: SFMNet - Confusion matrix on anatomical site classification task on a
fold.
Figure 4.9: Confusion matrices on lesion classification task on a fold.
can see that transfer learning helps to train the models faster, thereby saving the
resources needed.
49
Figure 4.10: Some examples of the lesion segmentation task
50
Chapter 5
Conclusion and future work
In this work, the author proposed two unified models to solve four tasks for EGD
images: anatomical site classification, lesion detection, HP classification, and lesion
segmentation. The proposed models are jointly trained on the mixed data derived
from multiple sources. The multi-task learning forces the models to learn a powerful
unified representation across all the tasks and gain significant benefits. Overall, the
proposed models achieve high accuracy in the classification tasks while still yielding
competitive results compared to the single-task models trained separately.
This work evaluated the effectiveness of the popular backbones and contextual in-
formation aggregation modules. In addition, the proposed models are constructed
in a modular way, making it simple to test the integration of information processing
blocks.
Future work for this study could involve evaluating the model using several metrics
and comparing the current model to numerous other classification and segmentation
models. It may also involve comparing the model’s impact using several medical
datasets. This study can be further improved by finding an effective way to combine
loss functions instead of simply adding them together.
51
References
[1] R. Zemouri, N. Zerhouni, and D. Racoceanu, “Deep learning in the biomedical
applications: Recent and future status,” Applied Sciences, vol. 9, no. 8, p. 1526,
2019.
[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[3] MathWorks, “What is a convolutional neural network?,” URL:
https://www.mathworks.com/discovery/convolutional-neural-network-
matlab.html.
[4] A. Dertat, “Applied deep learning - part 4: Convolutional neural net-
works,” URL: https://towardsdatascience.com/ applied-deep-learning-part-4-
convolutional-neural-networks-584bc134c1e2, 2017.
[5] J. Shruti, “Introduction to different activation functions for deep learning,”
URL: https://medium.com/ @shrutijadon/survey-on-activation-functions -for-
deep-learning-9689331ba092, 2018.
[6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for seman-
tic segmentation,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 3431–3440, 2015.
[7] M. Ferguson, R. Ak, Y.-T. T. Lee, and K. H. Law, “Automatic localization of
casting defects with convolutional neural networks,” in 2017 IEEE international
conference on big data (big data), pp. 1726–1735, IEEE, 2017.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-
nition,” in Proceedings of the IEEE conference on CVPR, pp. 770–778, 2016.
[9] PluralSight, “Introduction to densenet with tensorflow,” URL:
https://www.pluralsight.com/guides/introduction-to-densenet-with-tensorflow.
[10] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for
biomedical image segmentation,” in MICCAI, pp. 234–241, Springer, 2015.
52
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural
information processing systems, vol. 30, 2017.
[12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un-
terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image
is worth 16x16 words: Transformers for image recognition at scale,” arXiv
preprint arXiv:2010.11929, 2020.
[13] K. Yue, M. Sun, Y. Yuan, F. Zhou, E. Ding, and F. Xu, “Compact generalized
non-local network,” Advances in neural information processing systems, vol. 31,
2018.
[14] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 7132–
7141, 2018.
[15] S. Huang, Z. Lu, R. Cheng, and C. He, “Fapn: Feature-aligned pyramid network
for dense image prediction,” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 864–873, 2021.
[16] P. M. Treuting, M. J. Arends, and S. M. Dintzis, “11 - upper gastrointestinal
tract,” in Comparative Anatomy and Histology (Second Edition) (P. M. Treut-
ing, S. M. Dintzis, and K. S. Montine, eds.), pp. 191–211, San Diego: Academic
Press, second edition ed., 2018.
[17] H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal,
and F. Bray, “Global cancer statistics 2020: Globocan estimates of incidence
and mortality worldwide for 36 cancers in 185 countries,” CA: a cancer journal
for clinicians, vol. 71, no. 3, pp. 209–249, 2021.
[18] A. R. Pimenta-Melo, M. Monteiro-Soares, D. Libˆanio, and M. Dinis-Ribeiro,
“Missing rate for gastric cancer during upper gastrointestinal endoscopy: a
systematic review and meta-analysis,” European journal of gastroenterology &
hepatology, vol. 28, no. 9, pp. 1041–1049, 2016.
[19] S. Menon and N. Trudgill, “How commonly is upper gastrointestinal cancer
missed at endoscopy? a meta-analysis,” Endoscopy international open, vol. 2,
no. 02, pp. E46–E50, 2014.
[20] Y. Shimodate, M. Mizuno, A. Doi, N. Nishimura, H. Mouri, K. Matsueda, and
H. Yamamoto, “Gastric superficial neoplasia: high miss rate but slow progres-
sion,” Endoscopy International Open, vol. 5, no. 08, pp. E722–E726, 2017.
53
[21] J. McCarthy, “What is artificial intelligence,” URL: http://www-formal. stan-
ford. edu/jmc/whatisai. html, 2004.
[22] C. E. IBM, “What is machine learning?,” URL:
https://www.ibm.com/cloud/learn/machine-learning, 2020.
[23] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT
press, 2018.
[24] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and
functional architecture in the cat’s visual cortex,” The Journal of physiology,
vol. 160, no. 1, p. 106, 1962.
[25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278–2324, 1998.
[26] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv
preprint arXiv:1609.04747, 2016.
[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” Communications of the ACM, vol. 60,
no. 6, pp. 84–90, 2017.
[28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition,” in ICLR (Y. Bengio and Y. LeCun, eds.), 2015.
[29] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
connected convolutional networks,” in Proceedings of the IEEE conference on
CVPR, pp. 4700–4708, 2017.
[30] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Seg-
former: Simple and efficient design for semantic segmentation with transform-
ers,” Advances in Neural Information Processing Systems, vol. 34, pp. 12077–
12090, 2021.
[31] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in
Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 7794–7803, 2018.
[32] M. Yi-de, L. Qing, and Q. Zhi-Bai, “Automated image segmentation using
improved pcnn model based on cross-entropy,” in Proceedings of 2004 Inter-
national Symposium on Intelligent Multimedia, Video and Speech Processing,
2004., pp. 743–746, IEEE, 2004.
54
[33] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso, “Gen-
eralised dice overlap as a deep learning loss function for highly unbalanced
segmentations,” in Deep learning in medical image analysis and multimodal
learning for clinical decision support, pp. 240–248, Springer, 2017.
55